Skip to content

Conversation

@Kuingsmile
Copy link
Contributor

In this PR I made two fixes to the duplicate calculation:

  • Fix Bloom membership logic: use AND across all hashes isDup &= (ret & byte) != 0; instead of overwriting isDup each iteration isDup = (ret & byte) != 0;. The old behavior effectively depended only on the last hash, which could lead to incorrect duplicate results. I have saw this error in my RNA-seq data analysis result.

  • Fix non–power-of-two masking at accuracy level 6: at level 6, mBufNum=6 and PRIME_ARRAY_LEN * mBufNum is not a power of two, so offset &= mask is not equivalent to modulo and causes biased indexing. I changed level 6 to use mBufNum=8 (mBufNum *= 4), making PRIME_ARRAY_LEN * mBufNum a power of two so the existing offset &= mask logic is correct.

@sfchen sfchen merged commit fb04a1a into OpenGene:master Jan 14, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants